A Novel and Noninvasive Risk Assessment Score and Its Child-to-Adult Trajectories to Screen Subclinical Renal Damage in Middle Age

This study aimed to develop a noninvasive, economical and effective subclinical renal damage (SRD) risk assessment tool to identify high-risk asymptomatic people from a large-scale population and improve current clinical SRD screening strategies. Based on the Hanzhong Adolescent Hypertension Cohort, SRD-associated variables were identified and the SRD risk assessment score model was established and further validated with machine learning algorithms. Longitudinal follow-up data were used to identify child-to-adult SRD risk score trajectories and to investigate the relationship between different trajectory groups and the incidence of SRD in middle age. Systolic blood pressure, diastolic blood pressure and body mass index were identified as SRD-associated variables. Based on these three variables, an SRD risk assessment score was developed, with excellent classification ability (AUC value of ROC curve: 0.778 for SRD estimation, 0.729 for 4-year SRD risk prediction), calibration (Hosmer—Lemeshow goodness-of-fit test p = 0.62 for SRD estimation, p = 0.34 for 4-year SRD risk prediction) and more potential clinical benefits. In addition, three child-to-adult SRD risk assessment score trajectories were identified: increasing, increasing-stable and stable. Further difference analysis and logistic regression analysis showed that these SRD risk assessment score trajectories were highly associated with the incidence of SRD in middle age. In brief, we constructed a novel and noninvasive SRD risk assessment tool with excellent performance to help identify high-risk asymptomatic people from a large-scale population and assist in SRD screening.


Introduction
Chronic kidney disease (CKD) is defined as abnormalities in kidney structure or function for at least 3 months with implications for health [1]. CKD has become a major public health concern due to its high prevalence and all-cause mortality [2,3]. The Global Burden of Disease Study reported that 697.5 million individuals suffered from CKD in 2017, with an overall prevalence of 9.1% [4]. A systematic review on the regional prevalence of CKD in Asia showed a substantial variation in CKD prevalence ranging from 7.0% in South Korea to 34.3% in Singapore, while China and India had the highest absolute number of people with CKD (159.8 million and 140.2 million, respectively) [5]. CKD is associated with a high risk of hospitalization, cardiovascular events, cognitive dysfunction, morbidity and all-cause mortality [6][7][8]. In addition, CKD may be accompanied by several other complications, including anemia, secondary hyperparathyroidism and electrolyte disturbances, creating substantial health care costs [9][10][11] and indicating the urgent need service mainly contributed to the loss of follow-up. This study was clinically registered (NCT02734472) and approved by the Ethics Committee of First Affiliated Hospital of Xi'an Jiaotong University (Ethical Approval number: XJTU1AF2015LSL-047). All subjects gave written informed consent in advance. In addition, we obtained the consent of a parent/guardian for participants <18 years of age.

Anthropometric Measurements
Baseline clinical information, including demographic characteristics, histories of hypertension, hyperlipidemia, stroke and diabetes, history of cigarette smoking and alcohol consumption and cardiovascular complications, was collected using a standardized selfquestionnaire. Body weight, height, waist circumference and hip circumference were measured by trained staff via standardized procedures. Body mass index (BMI) was calculated as weight in kilograms divided by height in meters squared (kilograms per meter squared). The average values of replicate measurements were used for further analysis.

Blood Pressure Measurements
Systolic and diastolic blood pressure were measured three times by trained and certified staff via WHO recommended procedures (in a seated position in a quiet and comfortable environment, 5-min rest before measurement, 2-min interval between examinations). Mean values of blood pressure were used for further analysis.

Definitions
In this study, subclinical renal damage was defined as an eGFR between 30 and 60 mL/min/1.73 m 2 or a uACR more than 2.5 mg/mmol in men and 3.5 mg/mmol in women [15]. Cigarette smokers were defined as subjects with >six months of smoking history during their lifetime (continuous or cumulative) [29]. Participants who reported that they drank alcohol (liquor, beer or wine) every day and that their alcohol consumption lasted for more than 6 months were defined as drinkers [30].

Statistical Analysis
To identify effective and reliable clinical parameters with high screening or early diagnostic value for SRD, we analyzed the cross-sectional data in 2017 (n = 2303) and provided a novel feature selection strategy by combining three machine learning methods (completecase analyses), including LASSO regression, random forest and the SVM-REF algorithm. LASSO regression was performed via the R package "glmnet" [31], the random forest method was carried out by the R package "randomForest" and the SVM-REF approach was achieved by the R packages "sigFeature" and "e1071". A logistic regression model was constructed based on the R package "rms". The 2303 participants were randomly assigned to the training set (70%, n = 1611) and the internal validation set (30%, n = 692). The R package "pROC" was used to calculate the area under the curve (AUC) value of the receiver operating characteristic (ROC) curve [32]. In addition, calibration curve analysis and the Hosmer-Lemeshow goodness-of-fit test were performed using the R packages "rms" and "ResourceSelection". Decision curve analysis was conducted by the R package "rmda" to evaluate the potential clinical application value and net benefit.
Next, group-based trajectory modeling was achieved by the "traj" package [33] in R software to identify the optimal number of subgroups with similar SRD risk score trajectories among those with complete blood pressure and BMI data during the 30-year follow-up in this cohort (n = 1048, complete-case analysis). Categorical data are summarized as frequencies and percentages. Continuous variables are reported as the mean ± standard deviation (if normally distributed) or the median (25th and 75th percentile ranges). Independent sample t-tests, one-way ANOVA, Mann-Whitney U tests and Kruskal-Wallis tests were performed for the difference analysis of continuous variables according to their group, distribution and variance. Logistic regression analysis was carried out by SPSS software (SPSS Inc., Chicago, IL, USA). Statistical significance was considered at a two-sided p value <0.05 for all analyses.

Study Population
The flow chart of the present study was shown in Figure 1. Overall, the latest follow-up data (the 7th follow-up, in 2017) of 2303 participants were included in the cross-sectional analysis to perform the machine learning feature selection and identify variables highly associated with SRD. Then, these 2303 participants were randomly assigned to the training set (70%, n = 1611) and the internal validation set (30%, n = 692). The training set was used to construct the SRD risk score model and the validation set was used to evaluate the SRD estimation performance. The data in 2013 (the 6th follow-up) were also included to evaluate the 4-year SRD risk prediction performance. The characteristics included in the model construction and validation of the participants in the training and internal validation sets are shown in Table 1. All variables have no significant differences between the training and internal validation sets, which suggested the data consistency and reasonableness of grouping. In addition, participants with complete blood pressure and BMI data during the 30-year follow-up were included in further group-based trajectory modeling analysis to identify the SRD risk score trajectories (n = 1048).

Feature Selection
A heatmap ( Figure S1 in Supplementary Materials) showed the correlation among SRD and other 25 SRD-associated variables (anthropometric parameters, blood pressure level, biochemical parameters, diabetes history, etc.). Considering the data multicollinearity, it is necessary to conduct feature selection to identify the most important variables and then construct SRD risk models. In this study, we combined three machine learning algorithms to achieve accurate feature selection, including LASSO regression analysis, the random forest algorithm and the SVM-RFE algorithm. In LASSO regression analysis, 10-fold cross-validation was performed to detect the optimal AUC value and minimal parameters. Finally, we selected six features among 25 variables: systolic blood pressure, diastolic blood pressure, BMI, triglyceride, heart rate and diabetes ( Figure 2A). The SVM-RFE algorithm was also used to achieve feature selection according to the optimal classification accuracy. Four variables were identified as key features: diastolic blood pressure, systolic blood pressure, BMI and body weight ( Figure 2B). In addition, the random forest algorithm suggested six features (diastolic blood pressure, systolic blood pressure, BMI, triglyceride, serum chloride and serum potassium) to reach the minimum cross-validation error ( Figure 2C). Meanwhile, based on the mean decrease in the Gini coefficient, the importance of variables in the random forest model were calculated ( Figure 2D). Finally, by combining these three machine learning feature selection algorithms, we selected diastolic blood pressure, systolic blood pressure and BMI as hub variables for further analysis and model construction.
Bioengineering 2023, 10, x FOR PEER REVIEW 6 of 16 fold cross-validation was performed to detect the optimal AUC value and minimal parameters. Finally, we selected six features among 25 variables: systolic blood pressure, diastolic blood pressure, BMI, triglyceride, heart rate and diabetes ( Figure 2A). The SVM-RFE algorithm was also used to achieve feature selection according to the optimal classification accuracy. Four variables were identified as key features: diastolic blood pressure, systolic blood pressure, BMI and body weight ( Figure 2B). In addition, the random forest algorithm suggested six features (diastolic blood pressure, systolic blood pressure, BMI, triglyceride, serum chloride and serum potassium) to reach the minimum cross-validation error ( Figure 2C). Meanwhile, based on the mean decrease in the Gini coefficient, the importance of variables in the random forest model were calculated ( Figure 2D). Finally, by combining these three machine learning feature selection algorithms, we selected diastolic blood pressure, systolic blood pressure and BMI as hub variables for further analysis and model construction. (D) Importance of the parameters was assessed by a random forest algorithm. AUC, area under the curve; RMSE, root mean square error; mDBP, mean diastolic blood pressure; mSBP, mean systolic blood pressure; BMI, body mass index; TG, triglyceride; Cls, serum chloride iion; Ks, serum potassium; Nas, serum sodium; WHR, weight-to-height ratio; TBil, total bilirubin; UA, uric acid; HR, heart rate; TC, total cholesterol.

Construction and Validation of the SRD Risk Assessment Model
Logistic regression analysis was performed to establish an SRD risk assessment model based on data from the training set: SRD index = 0.020143 × SBP + 0.039718 × DBP + 0.063076 × BMI − 9.211994, SRD risk score = 1/(1 + e −SRD index ). Meanwhile, a corresponding nomogram was constructed to achieve more efficient clinical application ( Figure 3A). In detail, according to SBP, DBP and BMI data, total points can be calculated to evaluate the diagnostic possibility of SRD. High possibility indicates the need for further blood or urine testing to determine renal function, while low possibility indicates little need to take further tests, so as to achieve large-scale screening or self-monitoring. Next, we validated the classification ability of the model, and the AUC value of the ROC curve reached 0.778 (for SRD real-time estimation) and 0.729 (for 4-year SRD risk prediction) in the internal validation set ( Figure 3B,C). The optimal cutoff value for SRD real-time estimation is 0.153, which leads to a sensitivity of 0.685 and specificity of 0.779. Meanwhile, the optimal cutoff value for 4-year SRD risk prediction is 0.117 which leads to a sensitivity of 0.767 and a specificity of 0.598. The calibration curve analysis and the Hosmer-Lemeshow goodness-of-fit test (p = 0.62 for SRD real-time estimation, p = 0.34 for SRD 4-year risk prediction) indicated that this model had good calibration in both SRD real-time estimation and SRD 4-year risk prediction ( Figure 3D,E). In addition, as the SRD estimation decision curve analysis (DCA) showed, compared to the SRD screening decision strategies currently used in clinical practice, which mainly focus on the specific higher-risk conditions, such as hypertension, obesity and diabetes, more potential net benefit can be obtained in all ranges of risk thresholds using this SRD assessment model to assist in SRD screening decision making ( Figure 3F,G). The results of the SRD 4-year risk prediction DCA also supported this conclusion. In fact, SBP, DBP and BMI data are easy to collect in clinical practice by noninvasive examination, which indicates that it is possible for our models to evaluate or predict the SRD risk and identify high-risk asymptomatic people from a large-scale population, which can improve existing SRD screening strategies.

SRD Risk Score Trajectory
SRD risk scores during the 30-year follow-up were calculated based on the diastolic blood pressure, systolic blood pressure and BMI data. Then, we performed group-based trajectory modeling analysis and identified three SRD risk score trajectory groups: stable, increasing-stable and increasing (Figure 4). The SRD risk scores of all three groups have trends of increasing with age from childhood to middle age and have similar slope increases before about 25 years old. After this age, the stable group (n = 376; 35.9%) endured relatively lower SRD risk score levels and SRD risk scores compared to the other two group, which continued to increase. The increasing-stable group (n = 404; 38.5%) was characterized by SRD risk scores increasing to a relatively higher level and then holding steady after about 40 years old. Meanwhile, the increasing group (n = 268; 25.6%) was characterized by a sustained increase from childhood to middle age and reached a higher level than both the stable group and increasing-stable group. ment model, which showed this model had greater potential clinical benefits than each individual variable used to assess SRD risk in current clinical practice such as hypertension, diabetes and BMI. SRD, subclinical renal damage; SBP, systolic blood pressure; DBP, diastolic blood pressure; BMI, body mass index; AUC, area under the curve; DCA, decision curve analysis.

SRD Risk Score Trajectory
SRD risk scores during the 30-year follow-up were calculated based on the diastolic blood pressure, systolic blood pressure and BMI data. Then, we performed group-based trajectory modeling analysis and identified three SRD risk score trajectory groups: stable, increasing-stable and increasing (Figure 4). The SRD risk scores of all three groups have trends of increasing with age from childhood to middle age and have similar slope increases before about 25 years old. After this age, the stable group (n = 376; 35.9%) endured relatively lower SRD risk score levels and SRD risk scores compared to the other two group, which continued to increase. The increasing-stable group (n = 404; 38.5%) was characterized by SRD risk scores increasing to a relatively higher level and then holding steady after about 40 years old. Meanwhile, the increasing group (n = 268; 25.6%) was characterized by a sustained increase from childhood to middle age and reached a higher level than both the stable group and increasing-stable group. . Three SRD risk score trajectory groups identified in this study using group-based trajectory modeling analysis: stable group, increasing-stable group and increasing group. SRD, subclinical renal damage. Table 2 shows the data of partial anthropometry and biochemical indicator tests in 1987 and 2017 according to these three SRD risk score groups. Among these 1048 participants, 583 (55.6%) were males and 465 (44.4%) were females. The median age in 2017 was 43 years old. Differences in the proportion of males, age, incidence of hyperlipidemia, incidence of hypertension, current smoking, alcohol consumption, waist circumference, hip circumference, TC, TG, LDL-C, HDL-C, serum uric acid, serum creatinine, urine albumin and uACR were statistically significant (p <0.05). Occupation, education, marital status,  Figure 4. Three SRD risk score trajectory groups identified in this study using group-based trajectory modeling analysis: stable group, increasing-stable group and increasing group. SRD, subclinical renal damage. Table 2 shows the data of partial anthropometry and biochemical indicator tests in 1987 and 2017 according to these three SRD risk score groups. Among these 1048 participants, 583 (55.6%) were males and 465 (44.4%) were females. The median age in 2017 was 43 years old. Differences in the proportion of males, age, incidence of hyperlipidemia, incidence of hypertension, current smoking, alcohol consumption, waist circumference, hip circumference, TC, TG, LDL-C, HDL-C, serum uric acid, serum creatinine, urine albumin and uACR were statistically significant (p <0.05). Occupation, education, marital status, incidence of carotid atherosclerosis, heart rate (both in 1987 and in 2007), urine uric acid (uUA) and eGFR were not significantly different. Individuals in the SRD risk score stable group were more likely to be females, and more likely to have a lower waist circumference, hip circumference, TC, TG, LDL-C and serum UA. In addition, the SRD risk score increasing group a higher incidence of hyperlipidemia and hypertension, as well as higher rate of current smoking and alcohol consumption.

Association between Novel SRD Risk Score Trajectories and Subclinical Renal Damage
SRD incidence was significantly different among the three SRD risk score groups (p < 0.05). Figure 5A shows that the SRD risk score increasing group had a higher SRD incidence rate in middle age (19%) compared to stable group (8.8%) and stable-increasing group (13.4%). We found that the uACR was significantly different among the three SRD risk score groups (p < 0.05), whereas the GFR was not significantly different (p = 0.26). The increasing group had a significantly higher uACR level (1.25 (0.74-2.34)) than the increasing-stable group (0.99 (0.64-1.96)) and the stable group (0.85 (0.57-1.33)). Additionally, the uACR levels between the stable, stable-increasing and increasing group were also significantly different (p = 0.002 for stable group compared to stable-increasing group, p < 0.001 for stable group compared to increasing group, p = 0.011 for stable-increasing group compared to increasing group). Moreover, the increasing group had a lower eGFR (94.3 (85.9-106.0)) compared to stable group (97.2 (86.2-107.0)) and stable-increasing group (97.7 (87.1-106.3)). The scatter diagrams of uACR levels and eGFR levels among these three groups are shown in Figure 5B,C. Next, logistic regression was performed to investigate the association between the SRD risk score trajectory groups and SRD incidence. The trajectory groups were defined as dummy independent variables, and the stable group was the control group in the logistic regression. Our results showed that the increasing group and increasing-stable group had significantly greater odds of SRD incidence in middle age than the stable group. The increasing-stable group had an OR of 1.6 (95% CI, 1.01 to 2.54), and the increasing group had an OR of 2.44 (95% CI, 1.53 to 3.91). The adjusted logistic regression model showed that ORs were slightly attenuated after adjustment for gender and age. The increasing-stable group had an OR of 1.53 (95% CI, 0.96 to 2.43), and the increasing group had an OR of 2.39 (95% CI, 1.49 to 3.84). Additional adjustment for waist circumference, hip circumference, TC, TG, LDL-C and HDL-C also attenuated the ORs. The increasing-stable group had an OR of 1.25 (95% CI, 0.77 to 2.05), and the increasing group had an OR of 1.75 (95% CI, 1.05 to 2.91). Finally, after further adjusting for the incidence of current smoking and alcohol consumption, the ORs of the increasing-stable group were 1.24 (95% CI, 0.76 to 2.03) and the ORs of the increasing group were 1.73 (95% CI, 1.04 to 2.89). These results indicated that these SRD risk score trajectories can serve as a strong predictor for the SRD incidence risk in middle age (Table 3). In addition, through long-term trajectory analysis, we can also demonstrate the good performance and reliability of this SRD risk assessment score in longitudinal observation. (B,C) Scatter diagrams of eGFR levels and uACR levels among these three SRD risk score trajectory groups. SRD, subclinical renal damage; eGFR, estimated glomerular filtration rate; uACR, urinary albumin-to-creatinine ratio. # p < 0.05 vs. stable group and $ p < 0.05 vs.

Main Findings
Three predictive factors (SBP, DBP and BMI) for SRD in middle age were identified using an integrated feature selection strategy. Based on these three predictive factors, a novel noninvasive SRD risk assessment model was established that showed excellent classification ability, calibration and potential clinical benefits for SRD estimation and SRD 4-year risk prediction. These results indicated that it is possible for our models to identify high-risk asymptomatic people from a large-scale population and help the clinical SRD early screening decision in middle age. Additionally, through subsequent cohort analysis, we identified three trajectory groups for this novel SRD risk assessment score using 30-year follow-up data. We found that the incidence of SRD in middle age and uACR levels were highly associated with these risk score trajectories. Further logistic regression analysis indicated that these SRD risk score trajectories can serve as a strong predictor for the SRD incidence risk in middle age. Therefore, longitudinal observation further confirmed the value of this risk score to generate individualized risk estimates and further participate in clinical screening decisions for SRD in middle age. In summary, we constructed a novel, simple and low-cost risk assessment tool for SRD screening, which presented good performance in predicting SRD risk in middle age. The convenience of this model makes it possible to assess the SRD risk of asymptomatic people and then carry out further SRD screening.

Prior Studies and the Focus of our Investigation
The detection and screening for SRD is critical because it can correspond to the CKD stages (G3a stage, G3b stage in GFR Category and A2 stage, A3 stage in persistent albuminuria category) which are associated with moderately increased risk (yellow risk) or high risk (orange risk) for the concurrent complications and future outcomes; these are also are the most critical periods for early diagnosis and intervention for CKD. However, SRD is usually asymptomatic until an advanced disease stage, and estimation methods of renal function, such as the measurement of serum creatinine concentration, urine protein or albumin concentration are costly for long-term follow-up or large-scale screening [34,35]. In current clinical practice, only patients with specific higher-risk conditions, such as hypertension, obesity and diabetes are recommended to be screened for renal function conditions or SRD. It is still difficult to apply routine SRD screening in a large-scale general population, especially for asymptomatic adults, due to the lack of a more economical and effective noninvasive risk assessment tool for SRD [1,36]. Therefore, a simple and noninvasive SRD risk assessment tool is urgently needed to assist in the SRD screening decision and improve large-scale SRD screening strategies. SRD is attributed to several risk factors, such as hypertension, diabetes, older age and obesity [37][38][39]. There have been numerous efforts to construct prediction models for the risk of decreasing eGFR in CKD [22,24]. However, the estimation or prediction of SRD can be more useful than only predicting a decrease in eGFR from the perspective of identifying the prognostic risk of CKD. In addition, too many variables and biochemical examination results were included in existing models, which complicated their translation to clinical practice for large-scale screening. Hence, in this study, we provided a novel feature-selection strategy by combining three machine learning methods, and first established an SRD risk assessment model calculated only by SBP, DBP and BMI data, which may have greater utility in clinical application. Additionally, our risk assessment model had better performance than those in previous studies: excellent classification ability (AUC value of the ROC curve: 0.778 for SRD estimation, 0.729 for 4-year SRD risk prediction in the validation set), calibration (Hosmer-Lemeshow goodness-of-fit test p = 0.62 for SRD estimation, p = 0.34 for 4-year SRD risk prediction) and potential clinical benefits.
In addition, most existing prediction models lack a longitudinal cohort analysis, such as group-based trajectory modeling analysis, which could reflect the relationship between model trajectory and SRD incidence [40,41]. Therefore, in the current study, we combined SBP, DBP and BMI data to calculate a novel SRD risk assessment score and then performed a trajectory analysis. Ultimately, three trajectory groups (increasing, increasing-stable, and stable) were identified based on 30-year follow-up data, and the incidence of SRD in middle age and uACR levels were highly associated with these risk score trajectories. Compared with the stable group, the increasing group and increasing-stable group had a significantly higher uACR. In addition, the results of the logistic regression showed that these three SRD risk assessment score trajectories could serve as ideal predictors of the incidence of SRD in middle age. Several other studies and some of our previous works have tried to investigate the relationship between SRD incidence and its risk-factor trajectories, such as SBP trajectory, DBP trajectory, MAP trajectory and BMI trajectory [13,15]. However, singlevariable trajectory analyses have limitations because they ignore the interaction among multiple factors [42]. Hence, the group-based trajectory analysis for the SRD risk assessment score in the current work, which gives full consideration to the characteristics of SBP, DBP and BMI, is also a breakthrough for SRD-associated trajectory modeling analysis strategies.

Limitations and Future Directions
The present study used a community-based cohort followed for 30 years, which represents a large population. It is prospective in nature and consists of representative data from the general population. However, it should be noted that this study has the following limitations. First, our study used a racially-homogenous cohort from multiple rural areas in northern China, which limited the generalizability of our results, and validation using other cohorts with different backgrounds of ethnicities and populations will be performed in our further studies. Second, this work was not externally validated, which may also have limited the generalizability of our results. Notwithstanding this limitation, our study provided a novel SRD risk assessment tool that has both good performance in cross-sectional analysis and longitudinal analysis as well as the convenience of clinical application. In addition, to our knowledge, this is the first study to perform a group-based trajectory modeling longitudinal analysis for an SRD risk assessment tool, which revealed that realistic SRD outcomes in middle age correspond to the development trend of the risk score suggested by the SRD risk assessment model.

Conclusions
In conclusion, we used a large community-based cohort followed for 30 years to establish a novel, simple and low-cost SRD risk assessment tool and performed longitudinal group-based trajectory analysis for this tool. Internal validation suggested that our risk assessment model has excellent classification ability (AUC value of the ROC curve: 0.778 for SRD estimation, 0.729 for 4-year SRD risk prediction), calibration (Hosmer-Lemeshow goodness-of-fit test p = 0.62 for SRD estimation, p = 0.34 for 4-year SRD risk prediction) and potential clinical benefits. Further longitudinal trajectory analysis also confirmed the reliability of this SRD risk assessment score. Considering the good clinical utility, simplicity and convenience as well as the excellent performance of our model, it can identify high-risk asymptomatic people from a large-scale population and improve current clinical SRD screening strategies.